variable pair
Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery
Susanti, Yuni, Färber, Michael
Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery -- which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) -- offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: https://github.com/susantiyuni/path-to-causality
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > Canada > Ontario > Toronto (0.05)
- (16 more...)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior
Gong, Yue, Fernandez, Raul Castro
As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask: given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs' weights to derive a prior distribution over the correlation value of a variable pair. If an LLM's prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model's raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- North America > United States > Iowa > Story County > Ames (0.04)
- Europe > Germany (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.49)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Re-examining Granger Causality from Causal Bayesian Networks Perspective
The emergence of machine learning (ML) has been phenomenal, with ML-based models outperforming human intelligence, as in the case of AlphaGo [1] and, more recently, large language models (LLMs). With these advances, ML became state-of-the-art for scientific discovery in various fields of study [2]. However, ML algorithms fail to answer the crucial question "what" brings about an effect and "what if" questions i.e., ML cannot identify causal relationships in data and counterfactual questions. Hence, the need for causality and causal inference a field that focuses on unravelling causal interactions in data. Characterising these interactions in complex dynamical systems is a fundamental question in science [3]. Causal structure learning (CSL)--a computational causal discovery field, taking advantage of statistics and machine learning (ML) to unravel causal relations in data--is particularly appealing because it enables us to answer counterfactual questions [4, 5, 6, 7]. We adopt Pearl's causality framework.
- Europe > Austria > Vienna (0.14)
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Marche > Ancona Province > Ancona (0.04)
- Health & Medicine > Therapeutic Area > Neurology (0.46)
- Leisure & Entertainment > Games > Go (0.34)
A Natural Language Processing Approach to Support Biomedical Data Harmonization: Leveraging Large Language Models
Li, Zexu, Prabhu, Suraj P., Popp, Zachary T., Jain, Shubhi S., Balakundi, Vijetha, Ang, Ting Fang Alvin, Au, Rhoda, Chen, Jinying
Biomedical research requires large, diverse samples to produce unbiased results. Automated methods for matching variables across datasets can accelerate this process. Research in this area has been limited, primarily focusing on lexical matching and ontology based semantic matching. We aimed to develop new methods, leveraging large language models (LLM) and ensemble learning, to automate variable matching. Methods: We utilized data from two GERAS cohort (European and Japan) studies to develop variable matching methods. We first manually created a dataset by matching 352 EU variables with 1322 candidate JP variables, where matched variable pairs were positive and unmatched pairs were negative instances. Using this dataset, we developed and evaluated two types of natural language processing (NLP) methods, which matched variables based on variable labels and definitions from data dictionaries: (1) LLM-based and (2) fuzzy matching. We then developed an ensemble-learning method, using the Random Forest model, to integrate individual NLP methods. RF was trained and evaluated on 50 trials. Each trial had a random split (4:1) of training and test sets, with the model's hyperparameters optimized through cross-validation on the training set. For each EU variable, 1322 candidate JP variables were ranked based on NLP-derived similarity scores or RF's probability scores, denoting their likelihood to match the EU variable. Ranking performance was measured by top-n hit ratio (HRn) and mean reciprocal rank (MRR). Results:E5 performed best among individual methods, achieving 0.90 HR-30 and 0.70 MRR. RF performed better than E5 on all metrics over 50 trials (P less than 0.001) and achieved an average HR 30 of 0.98 and MRR of 0.73. LLM-derived features contributed most to RF's performance. One major cause of errors in automatic variable matching was ambiguous variable definitions within data dictionaries.
- Asia > Japan (0.25)
- North America > United States > Massachusetts > Suffolk County > Boston (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.70)
- Health & Medicine > Epidemiology (0.68)
Use of Prior Knowledge to Discover Causal Additive Models with Unobserved Variables and its Application to Time Series Data
Maeda, Takashi Nicholas, Shimizu, Shohei
This paper proposes two methods for causal additive models with unobserved variables (CAM-UV). CAM-UV assumes that the causal functions take the form of generalized additive models and that latent confounders are present. First, we propose a method that leverages prior knowledge for efficient causal discovery. Then, we propose an extension of this method for inferring causality in time series data. The original CAM-UV algorithm differs from other existing causal function models in that it does not seek the causal order between observed variables, but rather aims to identify the causes for each observed variable. Therefore, the first proposed method in this paper utilizes prior knowledge, such as understanding that certain variables cannot be causes of specific others. Moreover, by incorporating the prior knowledge that causes precedes their effects in time, we extend the first algorithm to the second method for causal discovery in time series data. We validate the first proposed method by using simulated data to demonstrate that the accuracy of causal discovery increases as more prior knowledge is accumulated. Additionally, we test the second proposed method by comparing it with existing time series causal discovery methods, using both simulated data and real-world data.
Deep Learning of Causal Structures in High Dimensions
Lagemann, Kai, Lagemann, Christian, Taschler, Bernd, Mukherjee, Sach
Recent years have seen rapid progress at the intersection between causality and machine learning. Motivated by scientific applications involving high-dimensional data, in particular in biomedicine, we propose a deep neural architecture for learning causal relationships between variables from a combination of empirical data and prior causal knowledge. We combine convolutional and graph neural networks within a causal risk framework to provide a flexible and scalable approach. Empirical results include linear and nonlinear simulations (where the underlying causal structures are known and can be directly compared against), as well as a real biological example where the models are applied to high-dimensional molecular data and their output compared against entirely unseen validation experiments. These results demonstrate the feasibility of using deep learning approaches to learn causal networks in large-scale problems spanning thousands of variables.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
VAEM: a Deep Generative Model for Heterogeneous Mixed Type Data
Ma, Chao, Tschiatschek, Sebastian, Hernández-Lobato, José Miguel, Turner, Richard, Zhang, Cheng
Deep generative models often perform poorly in real-world applications due to the heterogeneity of natural data sets. Heterogeneity arises from data containing different types of features (categorical, ordinal, continuous, etc.) and features of the same type having different marginal distributions. We propose an extension of variational autoencoders (VAEs) called VAEM to handle such heterogeneous data. VAEM is a deep generative model that is trained in a two stage manner such that the first stage provides a more uniform representation of the data to the second stage, thereby sidestepping the problems caused by heterogeneous data. We provide extensions of VAEM to handle partially observed data, and demonstrate its performance in data generation, missing data prediction and sequential feature selection tasks. Our results show that VAEM broadens the range of real-world applications where deep generative models can be successfully deployed.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Austria > Vienna (0.04)
Generating Difficult SAT Instances by Preventing Triangles
Escamocher, Guillaume, O'Sullivan, Barry, Prestwich, Steven David
When creating benchmarks for SAT solvers, we need SAT instances that are easy to build but hard to solve. A recent development in the search for such methods has led to the Balanced SAT algorithm, which can create k-SAT instances with m clauses of high difficulty, for arbitrary k and m. In this paper we introduce the No-Triangle SAT algorithm, a SAT instance generator based on the cluster coefficient graph statistic. We empirically compare the two algorithms by fixing the arity and the number of variables, but varying the number of clauses. The hardest instances that we find are produced by No-Triangle SAT. Furthermore, difficult instances from No-Triangle SAT have a different number of clauses than difficult instances from Balanced SAT, potentially allowing a combination of the two methods to find hard SAT instances for a larger array of parameters.
- North America > United States > New York (0.04)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (5 more...)
Causal Discovery as Semi-Supervised Learning
Oates, Chris. J., Mukherjee, Sach
In this short report, we discuss an approach to estimating causal graphs in which indicators of causal influence between variables are treated as labels in a machine learning formulation. Available data on the variables of interest are used as "inputs" to estimate the labels. We frame the problem as one of semi-supervised learning: available interventional data or background knowledge provide labels on some edges in the graph and the remaining edges are treated as unlabelled objects. To illustrate the key ideas, we consider a simple approach to feature construction (rooted in bivariate kernel density estimation) and embed this within a semi-supervised manifold framework. Results on yeast knockout data demonstrate that the proposed approach can identify causal relationships as validated by unseen interventional experiments. An advantage of the formulation we propose is that by reframing causal discovery as semi-supervised learning, it allows a range of data-driven approaches to be brought to bear on causal discovery, without demanding specification of full probability models or explicit models of underlying mechanisms.
- North America (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Learning to Order BDD Variables in Verification
Grumberg, O., Livne, S., Markovitch, S.
The size and complexity of software and hardware systems have significantly increased in the past years. As a result, it is harder to guarantee their correct behavior. One of the most successful methods for automated verification of finite-state systems is model checking. Most of the current model-checking systems use binary decision diagrams (BDDs) for the representation of the tested model and in the verification process of its properties. Generally, BDDs allow a canonical compact representation of a boolean function (given an order of its variables). The more compact the BDD is, the better performance one gets from the verifier. However, finding an optimal order for a BDD is an NP-complete problem. Therefore, several heuristic methods based on expert knowledge have been developed for variable ordering. We propose an alternative approach in which the variable ordering algorithm gains 'ordering experience' from training models and uses the learned knowledge for finding good orders. Our methodology is based on offline learning of pair precedence classifiers from training models, that is, learning which variable pair permutation is more likely to lead to a good order. For each training model, a number of training sequences are evaluated. Every training model variable pair permutation is then tagged based on its performance on the evaluated orders. The tagged permutations are then passed through a feature extractor and are given as examples to a classifier creation algorithm. Given a model for which an order is requested, the ordering algorithm consults each precedence classifier and constructs a pair precedence table which is used to create the order. Our algorithm was integrated with SMV, which is one of the most widely used verification systems. Preliminary empirical evaluation of our methodology, using real benchmark models, shows performance that is better than random ordering and is competitive with existing algorithms that use expert knowledge. We believe that in sub-domains of models (alu, caches, etc.) our system will prove even more valuable. This is because it features the ability to learn sub-domain knowledge, something that no other ordering algorithm does.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (15 more...)